Structure in the Enron Email Dataset

نویسندگان

  • P. S. Keila
  • David B. Skillicorn
چکیده

We investigate the structures present in the Enron email dataset using singular value decomposition and semidiscrete decomposition. Using word frequency profiles we show that messages fall into two distinct groups, whose extrema are characterized by short messages and rare words versus long messages and common words. It is surprising that length of message and word use pattern should be related in this way. We also investigate relationships among individuals based on their patterns of word use in email. We show that word use is correlated to function within the organization, as expected. We also show that word use among those involved in alleged criminal activity may be slightly distinctive.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Analyzing the ENRON Communication Network Using Agent-Based Simulation

Agent-based modeling, simulation, and network analysis approaches are one of the emergent techniques among soft computing literature. This paper presents an agent-based model for analyzing the characteristics of peerto-peer human communication networks. We focus on the process of the collapse of Enron Corporation, which is an interesting topic among the business management domain. The Enron ema...

متن کامل

Inferring Formal Titles in Organizational Email Archives

In the social network of large groups of people, such as companies and organizations, formal hierarchies with titles and lines of authority are established to define the responsibilities and order of power within that group. Although this information may be readily available for individuals within that group, the context this hierarchy provides in communications is not available to those outsid...

متن کامل

Introducing the Enron Corpus

A large set of email messages, the Enron corpus, was made public during the legal investigation concerning the Enron corporation. This dataset, along with a thorough explanation of its origin, is available at http://www-2.cs.cmu.edu/~enron/. This paper provides a brief introduction and analysis of the dataset. The raw Enron corpus contains 619,446 messages belonging to 158 users. We cleaned the...

متن کامل

Detecting Unusual and Deceptive Communication in Email

Deception theory suggests that deceptive writing is characterized by reduced frequency of first-person pronouns and exclusive words, and elevated frequency of negative emotion words and action verbs. We apply this model of deception to the Enron email dataset, and then apply singular value decomposition to elicit the correlation structure between emails. This allows us to rank emails by how wel...

متن کامل

Strategies for Cleaning Organizational Emails with an Application to Enron Email Dataset

Archived organizational email datasets have been considered valuable data resources for various studies, such as spam detection, email classification, Social Network Analysis (SNA), and text mining. Similar to other forms of raw data, email data can be messy and needs to be cleaned before any analysis is conducted. However, few studies have presented investigation on the cleaning of archived or...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Computational & Mathematical Organization Theory

دوره 11  شماره 

صفحات  -

تاریخ انتشار 2005